Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tool to downsample a BAM while retaining reads in low coverage areas. #893

Merged
merged 3 commits into from
Jan 5, 2023

Conversation

tfenne
Copy link
Member

@tfenne tfenne commented Dec 29, 2022

Still needs tests.

@tfenne tfenne self-assigned this Dec 29, 2022
@codecov-commenter
Copy link

codecov-commenter commented Dec 29, 2022

Codecov Report

Base: 95.66% // Head: 95.65% // Decreases project coverage by -0.00% ⚠️

Coverage data is based on head (78140b8) compared to base (da9ecbc).
Patch coverage: 94.64% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #893      +/-   ##
==========================================
- Coverage   95.66%   95.65%   -0.01%     
==========================================
  Files         125      126       +1     
  Lines        7239     7294      +55     
  Branches      507      487      -20     
==========================================
+ Hits         6925     6977      +52     
- Misses        314      317       +3     
Flag Coverage Δ
unittests 95.65% <94.64%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ulcrumgenomics/bam/DownsampleAndNormalizeBam.scala 93.33% <93.33%> (ø)
src/main/scala/com/fulcrumgenomics/bam/Bams.scala 96.40% <100.00%> (+0.27%) ⬆️
...fulcrumgenomics/umi/ConsensusCallingIterator.scala 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, one potential bug.


/** Returns the coverage at the given genomic position, or -1 if the position is not between start:end. */
def apply(i: Int): Int = {
if (i < start || i > end) -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider either returning 0 (zero coverage) or None (no coverage) instead of -1, which feels very Java-like.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think actually I can just go back to having it fail if the index is out of bounds.

*/
def add(t: Template): Boolean = {
val recs = t.allReads
.filterNot(_.secondary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This made me think about supplementary reads. Should those be upgraded to "primary" alignments or be filtered out as well? I'm thinking of a region where we have a large number of both primary and supplementary alignments, and how to pick there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely want to include supplementary reads, they're just part of a split-read alignment and therefore contribute to coverage. E.g. imagine an alignment to a circular genome where anything that maps over the break point will have a primary+supplementary, we want to count both of those.

.sortBy(r => (r.refName, r.start, r.end))
.iterator
.map(r => new Interval(r.refName, r.start, r.end))
new IntervalMergerIterator(iter, true, false, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad we can't call arguments by name, since what are the three booleans here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. For reference they are:

  • combineAbutting
  • enforceSameStrand
  • concatenateNames

.toSeq

val addsCoverage = recs.exists { rec =>
detector.getOverlaps(rec.asSam).iterator().exists { cov =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's time to have getOverlaps return an iterator itself?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think perhaps it's time to move on from HTSJDK's OverlapDetector and implement one in scala that suits our needs better :/ But not today.

@tfenne tfenne marked this pull request as ready for review January 5, 2023 16:07
@tfenne tfenne merged commit 53c2ae9 into main Jan 5, 2023
@tfenne tfenne deleted the tf_downsample_and_norm branch January 5, 2023 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants